{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "punL79CN7Ox6"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "_ckMIh7O7s6D"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "QrxSyyyhygUR"
      },
      "source": [
        "# Tweaking the Model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "S5Uhzt6vVIB2"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c05_nlp_tweaking_the_model.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c05_nlp_tweaking_the_model.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xiWacy71Cu54"
      },
      "source": [
        "In this colab, you'll investigate how various tweaks to data processing and the model itself can impact results. At the end, you'll once again be able to visualize how the network sees the related sentiment of each word in the dataset."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hY-fjvwfy2P9"
      },
      "source": [
        "## Import TensorFlow and related functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "drsUfVVXyxJl"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "\n",
        "from tensorflow.keras.preprocessing.text import Tokenizer\n",
        "from tensorflow.keras.preprocessing.sequence import pad_sequences"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZIf1N46jy6Ed"
      },
      "source": [
        "## Get the dataset\n",
        "\n",
        "We'll once again use the dataset containing Amazon and Yelp reviews. This dataset was originally extracted from [here](https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "m83g42sJzGO0"
      },
      "outputs": [],
      "source": [
        "!wget --no-check-certificate \\\n",
        "    https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P \\\n",
        "    -O /tmp/sentiment.csv"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "y4e6GG2CzJUq"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "dataset = pd.read_csv('/tmp/sentiment.csv')\n",
        "\n",
        "sentences = dataset['text'].tolist()\n",
        "labels = dataset['sentiment'].tolist()\n",
        "\n",
        "# Separate out the sentences and labels into training and test sets\n",
        "training_size = int(len(sentences) * 0.8)\n",
        "\n",
        "training_sentences = sentences[0:training_size]\n",
        "testing_sentences = sentences[training_size:]\n",
        "training_labels = labels[0:training_size]\n",
        "testing_labels = labels[training_size:]\n",
        "\n",
        "# Make labels into numpy arrays for use with the network later\n",
        "training_labels_final = np.array(training_labels)\n",
        "testing_labels_final = np.array(testing_labels)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "drDkTFMuzW6N"
      },
      "source": [
        "## Tokenize the dataset (with tweaks!)\n",
        "\n",
        "Now, we'll tokenize the dataset, but we can make some changes to this from before. Previously, we used: \n",
        "```\n",
        "vocab_size = 1000\n",
        "embedding_dim = 16\n",
        "max_length = 100\n",
        "trunc_type='post'\n",
        "padding_type='post'\n",
        "```\n",
        "\n",
        "How might changing the `vocab_size`, `embedding_dim` or `max_length` affect how the model performs?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "hjPUJFhQzuee"
      },
      "outputs": [],
      "source": [
        "vocab_size = 500\n",
        "embedding_dim = 16\n",
        "max_length = 50\n",
        "trunc_type='post'\n",
        "padding_type='post'\n",
        "oov_tok = \"<OOV>\"\n",
        "\n",
        "tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\n",
        "tokenizer.fit_on_texts(training_sentences)\n",
        "word_index = tokenizer.word_index\n",
        "training_sequences = tokenizer.texts_to_sequences(training_sentences)\n",
        "training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)\n",
        "\n",
        "testing_sequences = tokenizer.texts_to_sequences(testing_sentences)\n",
        "testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FwFjO1kg0UUK"
      },
      "source": [
        "## Train a Sentiment Model (with tweaks!)\n",
        "\n",
        "We'll use a slightly different model here, using `GlobalAveragePooling1D` instead of `Flatten()`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "ectP92fl0dFO"
      },
      "outputs": [],
      "source": [
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.GlobalAveragePooling1D(),\n",
        "    tf.keras.layers.Dense(6, activation='relu'),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
        "model.summary()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "7TQIaGjs073w"
      },
      "outputs": [],
      "source": [
        "num_epochs = 30\n",
        "history = model.fit(training_padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "alAlYort7gWV"
      },
      "source": [
        "## Visualize the training graph\n",
        "\n",
        "You can use the code below to visualize the training and validation accuracy while you try out different tweaks to the hyperparameters and model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "o9l5vBeU71vH"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "\n",
        "def plot_graphs(history, string):\n",
        "  plt.plot(history.history[string])\n",
        "  plt.plot(history.history['val_'+string])\n",
        "  plt.xlabel(\"Epochs\")\n",
        "  plt.ylabel(string)\n",
        "  plt.legend([string, 'val_'+string])\n",
        "  plt.show()\n",
        "  \n",
        "plot_graphs(history, \"accuracy\")\n",
        "plot_graphs(history, \"loss\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "SZzXE-pT8K57"
      },
      "source": [
        "## Get files for visualizing the network\n",
        "\n",
        "The code below will download two files for visualizing how your network \"sees\" the sentiment related to each word. Head to http://projector.tensorflow.org/ and load these files, then click the checkbox to \"sphereize\" the data.\n",
        "\n",
        "Note: You may run into errors with the projection if your `vocab_size` earlier was larger than the actual number of words in the vocabulary, in which case you'll need to decrease this variable and re-train in order to visualize."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "2Ex4o7Lc8Njl"
      },
      "outputs": [],
      "source": [
        "# First get the weights of the embedding layer\n",
        "e = model.layers[0]\n",
        "weights = e.get_weights()[0]\n",
        "print(weights.shape) # shape: (vocab_size, embedding_dim)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "bUL1zk5p8WIV"
      },
      "outputs": [],
      "source": [
        "import io\n",
        "\n",
        "# Create the reverse word index\n",
        "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n",
        "\n",
        "# Write out the embedding vectors and metadata\n",
        "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n",
        "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n",
        "for word_num in range(1, vocab_size):\n",
        "  word = reverse_word_index[word_num]\n",
        "  embeddings = weights[word_num]\n",
        "  out_m.write(word + \"\\n\")\n",
        "  out_v.write('\\t'.join([str(x) for x in embeddings]) + \"\\n\")\n",
        "out_v.close()\n",
        "out_m.close()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "lqyV8QYnD46U"
      },
      "outputs": [],
      "source": [
        "# Download the files\n",
        "try:\n",
        "  from google.colab import files\n",
        "except ImportError:\n",
        "  pass\n",
        "else:\n",
        "  files.download('vecs.tsv')\n",
        "  files.download('meta.tsv')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XUXAlNNk59gG"
      },
      "source": [
        "## Predicting Sentiment in New Reviews\n",
        "\n",
        "Below, we've again included some example new reviews you can test your results on."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "JbFTTcaK6Dan"
      },
      "outputs": [],
      "source": [
        "# Use the model to predict a review   \n",
        "fake_reviews = ['I love this phone', 'I hate spaghetti', \n",
        "                'Everything was cold',\n",
        "                'Everything was hot exactly as I wanted', \n",
        "                'Everything was green', \n",
        "                'the host seated us immediately',\n",
        "                'they gave us free chocolate cake', \n",
        "                'not sure about the wilted flowers on the table',\n",
        "                'only works when I stand on tippy toes', \n",
        "                'does not work when I stand on my head']\n",
        "\n",
        "print(fake_reviews) \n",
        "\n",
        "# Create the sequences\n",
        "padding_type='post'\n",
        "sample_sequences = tokenizer.texts_to_sequences(fake_reviews)\n",
        "fakes_padded = pad_sequences(sample_sequences, padding=padding_type, maxlen=max_length)           \n",
        "\n",
        "print('\\nHOT OFF THE PRESS! HERE ARE SOME NEWLY MINTED, ABSOLUTELY GENUINE REVIEWS!\\n')              \n",
        "\n",
        "classes = model.predict(fakes_padded)\n",
        "\n",
        "# The closer the class is to 1, the more positive the review is deemed to be\n",
        "for x in range(len(fake_reviews)):\n",
        "  print(fake_reviews[x])\n",
        "  print(classes[x])\n",
        "  print('\\n')\n",
        "\n",
        "# Try adding reviews of your own\n",
        "# Add some negative words (such as \"not\") to the good reviews and see what happens\n",
        "# For example:\n",
        "# they gave us free chocolate cake and did not charge us"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "l09c05_nlp_tweaking_the_model.ipynb",
      "private_outputs": true,
      "provenance": [
        {
          "file_id": "1C0zoO8Q3QRxWeHX0RQBSkPyqhe-RN3Hg",
          "timestamp": 1585785927302
        }
      ],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
